Skip to content

Conversation

@devbyteai
Copy link
Contributor

@devbyteai devbyteai commented Dec 26, 2025

Summary

Fixes #4388

This PR fixes the incorrect transcription_delay metric calculation when using STT-based turn detection (e.g., Deepgram Flux).


Problem

When using STT turn detection mode, the transcription_delay metric incorrectly shows ~0 seconds instead of reflecting the actual transcription latency.

User-Reported Behavior:

"EOU metrics showing ~0.79 transcription_delay when should reflect actual processing time"

The metric should measure the time between when the user stopped speaking and when the transcript was received, but it was always returning near-zero values.


Root Cause

In audio_recognition.py, the transcription_delay is calculated as:

transcription_delay = max(last_final_transcript_time - last_speaking_time, 0)

The bug was in the STT END_OF_SPEECH handler (line 452), which overwrote _last_speaking_time with time.time():

elif ev.type == stt.SpeechEventType.END_OF_SPEECH and self._turn_detection_mode == "stt":
    ...
    self._last_speaking_time = time.time()  # BUG: Overwrites the value!

Event Timeline in STT Mode (Buggy):

  1. START_OF_SPEECH → _last_speaking_time = time.time() (correct)
  2. FINAL_TRANSCRIPT → _last_final_transcript_time = time.time() (correct)
  3. END_OF_SPEECH → _last_speaking_time = time.time() (BUG - overwrites!)

Since END_OF_SPEECH typically arrives shortly after FINAL_TRANSCRIPT in STT mode, both timestamps become nearly identical, resulting in transcription_delay ≈ 0.


Solution

Remove the line that overwrites _last_speaking_time at END_OF_SPEECH in STT mode. The value was already correctly set at START_OF_SPEECH.

Comparison with VAD Mode:
VAD mode does NOT update _last_speaking_time at END_OF_SPEECH - it keeps the value from the last INFERENCE_DONE event. STT mode should follow the same pattern.

After Fix:

  1. START_OF_SPEECH → _last_speaking_time = time.time() (preserved)
  2. FINAL_TRANSCRIPT → _last_final_transcript_time = time.time()
  3. END_OF_SPEECH → No overwrite

Result: transcription_delay = last_final_transcript_time - last_speaking_time now correctly represents the actual transcription latency.


Testing

All 15 existing agent session tests pass:

tests/test_agent_session.py::test_events_and_metrics PASSED
tests/test_agent_session.py::test_tool_call PASSED
tests/test_agent_session.py::test_interruption[False-5.5] PASSED
tests/test_agent_session.py::test_interruption[True-5.5] PASSED
tests/test_agent_session.py::test_interruption_options PASSED
tests/test_agent_session.py::test_interruption_by_text_input PASSED
tests/test_agent_session.py::test_interruption_before_speaking[False-3.5] PASSED
tests/test_agent_session.py::test_interruption_before_speaking[True-3.5] PASSED
tests/test_agent_session.py::test_generate_reply PASSED
tests/test_agent_session.py::test_preemptive_generation[True-0.8] PASSED
tests/test_agent_session.py::test_preemptive_generation[False-1.1] PASSED
tests/test_agent_session.py::test_interrupt_during_on_user_turn_completed[False-0.0] PASSED
tests/test_agent_session.py::test_interrupt_during_on_user_turn_completed[False-2.0] PASSED
tests/test_agent_session.py::test_interrupt_during_on_user_turn_completed[True-0.0] PASSED
tests/test_agent_session.py::test_interrupt_during_on_user_turn_completed[True-2.0] PASSED

======================== 15 passed in 75.96s ========================

Backward Compatibility

No breaking changes - This fix only corrects the metric calculation. The actual agent behavior (speech recognition, turn detection, interruption handling) is completely unchanged.

Expected Impact:

  • Users with STT turn detection will now see accurate transcription_delay values in their metrics
  • Dashboards showing this metric will now report correct latency (previously under-reported as ~0)

Edge Cases Handled

  1. No VAD present - Already handled at lines 376-382, falls back to STT timestamps
  2. Multiple speech segments - START_OF_SPEECH updates _last_speaking_time for each new segment
  3. Preflight transcripts - Also update _last_final_transcript_time correctly
  4. VAD mode unchanged - Fix only affects STT turn detection mode

Files Changed

livekit-agents/livekit/agents/voice/audio_recognition.py

  • Removed the buggy self._last_speaking_time = time.time() line from END_OF_SPEECH handler
  • Added explanatory comment documenting why we don't update the timestamp here

Related Issues

Summary by CodeRabbit

  • Bug Fixes
    • Improved handling of initial speech timestamps, resulting in more accurate latency measurements and more reliable end-of-turn detection during voice recognition.
    • Reduces premature or inconsistent end-of-utterance decisions, especially when voice activity detection is active.
    • No other user-visible behavior changes beyond improved timing and reliability.

✏️ Tip: You can customize this high-level summary in your review settings.

@davidzhao
Copy link
Member

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Chef's kiss.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@davidzhao davidzhao requested a review from longcw January 15, 2026 19:26
Copy link
Member

@davidzhao davidzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devbyteai please remove PR_DESCRIPTION.md from the commit

@devbyteai devbyteai force-pushed the fix/transcription-delay-stt-mode branch from ee893ae to 0e2ffb6 Compare January 21, 2026 09:09
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 21, 2026

📝 Walkthrough

Walkthrough

Replaced sentinel checks for _last_speaking_time so END_OF_SPEECH and STT event paths compare against None and only set _last_speaking_time when VAD is disabled or it was previously None, preventing unintended overwrites of prior speaking timestamps.

Changes

Cohort / File(s) Change Summary
STT turn-detection timing
livekit-agents/livekit/agents/voice/audio_recognition.py
Use None as the uninitialized sentinel for _last_speaking_time; change conditions in FINAL_TRANSCRIPT, PREFLIGHT_TRANSCRIPT, and END_OF_SPEECH paths to avoid overwriting an existing speaking timestamp when VAD is used.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 I measured the hop, I kept the first chime,
No overwrite at ending, I saved the first time.
VAD gave its word, STT kept the tune,
Now latency sings true beneath the moon. 🕰️✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title correctly identifies the fix for issue #4388 regarding transcription_delay metric calculation in STT turn detection mode.
Linked Issues check ✅ Passed The pull request addresses the primary objective from #4388 by replacing the overwriting of _last_speaking_time at END_OF_SPEECH with None checks, ensuring transcription_delay accurately reflects the interval between speaking and transcription.
Out of Scope Changes check ✅ Passed All changes are within scope; the PR only modifies sentinel value checks from 0 to None in STT event handling to correct transcription_delay calculation as required by #4388.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

📜 Recent review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0e2ffb6 and 0cee66e.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (1)
  • livekit-agents/livekit/agents/voice/audio_recognition.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

  • livekit-agents/livekit/agents/voice/audio_recognition.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: unit-tests
🔇 Additional comments (3)
livekit-agents/livekit/agents/voice/audio_recognition.py (3)

376-382: Correct sentinel check for preserving VAD timing.

The change from == 0 to is None properly matches the type annotation (float | None) and initialization. The conditional logic now correctly:

  • Preserves VAD-provided _last_speaking_time when VAD is enabled
  • Falls back to STT timing when VAD is disabled or hasn't fired yet

428-430: Consistent sentinel check in PREFLIGHT_TRANSCRIPT handler.

Matches the pattern used at line 376, ensuring VAD timing is preserved here as well.


446-456: Core fix: END_OF_SPEECH no longer unconditionally overwrites the speaking timestamp.

This directly addresses issue #4388. Previously, END_OF_SPEECH (which can arrive immediately after FINAL_TRANSCRIPT in STT mode) was overwriting _last_speaking_time, causing transcription_delay ≈ 0.

With this fix:

  • When VAD is present: the timestamp from VAD's INFERENCE_DONE is preserved
  • When VAD is absent: STT timing is used as the fallback (with acknowledged limitations per the TODO at line 378)

This aligns STT mode behavior with VAD mode, which does not overwrite _last_speaking_time at END_OF_SPEECH.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@livekit-agents/livekit/agents/voice/audio_recognition.py`:
- Around line 452-453: The condition uses `self._last_speaking_time == 0` but
`_last_speaking_time` is initialized to None and reset to None, so replace
comparisons to 0 with explicit None checks; update the three spots in
audio_recognition.py where you see `if not self._vad or self._last_speaking_time
== 0` (and similar at the other two locations) to `if not self._vad or
self._last_speaking_time is None` so START_OF_SPEECH only sets the timestamp
when it truly hasn't been set, and END_OF_SPEECH/other branches behave
correctly; ensure you update all occurrences that reference
`_last_speaking_time` in the relevant methods to use `is None`.
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f4fcef and 0e2ffb6.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (1)
  • livekit-agents/livekit/agents/voice/audio_recognition.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Format code with ruff
Run ruff linter and auto-fix issues
Run mypy type checker in strict mode
Maintain line length of 100 characters maximum
Ensure Python 3.9+ compatibility
Use Google-style docstrings

Files:

  • livekit-agents/livekit/agents/voice/audio_recognition.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: unit-tests
  • GitHub Check: type-check (3.13)
  • GitHub Check: type-check (3.9)

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

…tion mode

Fixes livekit#4388

Remove the line that overwrites _last_speaking_time at END_OF_SPEECH in STT mode.
This was causing transcription_delay to always be ~0 since END_OF_SPEECH typically
arrives after FINAL_TRANSCRIPT, making both timestamps nearly identical.
@devbyteai devbyteai force-pushed the fix/transcription-delay-stt-mode branch from 0e2ffb6 to 0cee66e Compare January 21, 2026 09:21
Copy link
Member

@davidzhao davidzhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lg!

@davidzhao davidzhao merged commit 73eb202 into livekit:main Jan 21, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect transcription_delay when using STT turn detection mode

3 participants